TTU-WSU: GBAD

VAST 2009 Challenge

Challenge 1: Badge and Network Traffic

Authors and Affiliations:

· Jeffrey Graves, Tennessee Tech University, jagraves21@tntech.edu

· William Eberle, Tennessee Tech University, weberle@tntech.edu [PRIMARY contact]

· Lawrence Holder, Washington State University, holder@wsu.edu

Tool(s):

In order to analyze the badge and network traffic, we used the Graph-Based Anomaly Detection (GBAD) tool to focus the visualization on interesting structural anomalies. Initially created in 2006 as a joint venture between the University of Texas at Arlington and Washington State University, GBAD discovers anomalous instances of structural patterns in data, where the data represents entities, relationships and actions in graph form. Input to GBAD is a labeled graph in which entities are represented by labeled vertices and relationships or actions are represented by labeled edges between entities. Using the minimum description length (MDL) principle to identify the normative pattern that minimizes the number of bits needed to describe the input graph after being compressed by the pattern, GBAD embodies novel algorithms for identifying the three possible changes to a graph: modifications, insertions and deletions. Each algorithm discovers those substructures that match the closest to the normative pattern without matching exactly. As a result, GBAD is looking for those activities that appear to match normal patterns, but in fact are structurally different. GBAD is a Unix-based tool written in C, and uses the SUBDUE graph-based data mining system (www.subdue.org) as the engine for discovering the normative pattern in a graph. GBAD was developed by William Eberle and Lawrence Holder.

Video:

GBAD video

ANSWERS:

MC1.1: Identify which computer(s) the employee most likely used to send information to his contact in a tab-delimited table which contains for each computer identified: when the information was sent, how much information was sent and where that information was sent.

Traffic

MC1.2: Characterize the patterns of behavior of suspicious computer use.

In order to analyze the badge and network traffic of employees, we used our Graph-Based Anomaly Detection (GBAD) system. GBAD takes a graph-representation of data and applies three algorithms that analyze the graph for structural anomalies. Each of these algorithms is applied after the normative graph structure has been discovered. It is our hypothesis that such a system can discover knowledge in a graph representation of the badge and network traffic data that will (1) show the normal structure of the employee movements and network activity, and (2) show anomalies in employee behavior, indicating a possible insider threat.

In order to answer the challenge, we decided to focus on the movements and locations of the employees, along with their connections to the network. Based upon all of the information that was provided with the challenge, we made the following assumptions about this particular data set:

Any employee can piggyback from one to another (or from the outside). In other words, nobody is required to use their badge, as long as someone else will open the door for them.
No employee used a computer that was not assigned to them, for fear of discovery (or termination).
No employee spent the night at the facility. If they started the day inside the building, they must have piggy-backed behind another employee to gain entrance.

Starting with these simple assumptions, we created graphs based upon the movement of employees between areas (outside, building, classified) and the number of connections that were made by the employee each time they were in the building, where vertices represented locations and network connections, and edges indicated order of movements.

This process of creating graphs is performed manually, as the choice of an appropriate graph topology is domain dependent. For this mini-challenge, our graphs consisted of subgraphs that represented employee movements for a particular day. Each subgraph contained a backbone of
movement vertices. Attached to the movement vertices were two vertices representing where the person started and ended (i.e., outside, building, classified). The edges were labeled start and end. If network traffic was sent before the person moved again, a network vertex linked to the movement vertex via a sends edge is created. The network vertex was also linked to a vertex with a numerical label, representing how many messages were sent before the next movement occurred. Also attached to a movement vertex via a time edge was a vertex representing the time reported in the proximity log (e.g., early_morning 0:00-7:59, morning 8:00-11:59, after_noon 12:00-16:59, evening 17:00-20:59, night 21:00-23:59). A numerical vertex representing the hour was also connected to the time vertex via an hour edge.

Figure 1. Example subgraph.

In the example shown in Figure 1, a person entered the building in the early_morning between 7AM and 8AM. The person sent 2 network messages and then moved into the classified area in the morning between 8AM and 9AM. The person then left the classified area in the morning between 9AM and 10AM.

A graph input file for the GBAD system is an ASCII text file that defines the vertices with sequential numbers, and edges using these numbers to specify a connection between two vertices. Using a python script that converts the mini-challenge provided proxLog.csv and IPLog3.5.csv files, we generated graph input files that matched the topology described above. An example (partial) graph input file, created using this method, looks like the following:

v 1 location

v 2 classified

e 1 2 location_type

…

GBAD is a command-line program that can run on multiple operating systems (Linux, Windows, etc.). Once the graph files are created, GBAD is executed on each graph input file, returning (1) the normative pattern discovered in the specified graph input file, and (2) the top-N most anomalous patterns, where N is set to 1 by default. The graph input file and discovered patterns can be converted to the dot format and visualized in GraphViz.

We initially created one graph of all employee activity for all days. From that graph, we were able to discover the normative pattern for all employees across all days. Figure 2 shows a visualization of the normative pattern.

Figure 2. Normative pattern.

After uncovering the normative pattern, GBAD then uses three algorithms to discover all of the possible structural changes that can exist in a graph (i.e., modification, deletions, and insertions). Both the process to discover the normative pattern and the anomalies is done automatically with a single run of GBAD.

In order to determine which employee was the insider threat, we manually ranked our observations based upon which employees were involved in the following types of attributes:

· Piggybacking

· Movement

· Network activity

· Time of day

Based upon these criteria, we suspected that employee number 38 was involved because of patterns of behavior such as:

· Exiting the classified area with no record of entry (i.e., piggybacking).

· Found in the building sending network traffic with no report of how they got in there.

· Weekend activity.

· Large number of network connections.

· Activity at unusual times of the day.

Figure 3 shows an example of one of the anomalous instances reported by GBAD for this employee.

Figure 3. Example of unusual movement by employee at an abnormal time for that employee.

Some other interesting observations we made about other employee behavior were:

· Employee 12 likes working late – sometimes close to midnight.

· Employee 8 exits the building in the middle of the day, after making a network connection, and returns later in the day.

· Employee 26 moves around the facility significantly more than other employees.

GBAD can be used to detect anomalies of possible insider threat activity in a graph representation of data that captures relational information. While we use GraphViz to visualize the graph patterns, any graph visualization tool could be used. The main point is that the ability to discover normative patterns and anomalies is critical to the visual detection of insider threat activity in the data.